Nature Genetics
○ Springer Science and Business Media LLC
Preprints posted in the last 30 days, ranked by how well they match Nature Genetics's content profile, based on 240 papers previously published here. The average preprint has a 0.33% match score for this journal, so anything above that is already an above-average fit.
Hof, J. J. P.; Ning, C.; Quinn, L.; Speed, D.
Show abstract
Common complex diseases are clinically heterogeneous, yet most genome-wide association studies (GWAS) assume cases are genetically homogeneous. This challenge is compounded in large-scale biobanks, which increasingly combine cases ascertained under different recruitment strategies, raising concerns that heterogeneous case definitions may dilute genetic signal. To address this, we developed StratGWAS, a scalable framework that leverages clinical features of heterogeneity to construct a transformed phenotype that better reflects genetic liability within diseases. StratGWAS stratifies cases using secondary phenotypic information such as age of onset, medication burden, or recruitment definition. StratGWAS then estimates genetic covariance between strata, and derives a transformed phenotype that upweights cases with higher inferred genetic liability. Through simulation studies (N = 100k) and application to the UK Biobank (N = 368k), we show that StratGWAS consistently outperformed standard GWAS methods. Applied to 21 UK Biobank traits, StratGWAS upweighted individuals with earlier disease onset and higher medication burden, yielding respectively 17% and 4% more independent genome-wide significant loci than standard case control GWAS. Applied to depression, StratGWAS upweighted individuals with multiple diagnoses, greater psychiatric comorbidity, or higher self reported depressive symptoms, identifying eight additional independent loci compared to case-control GWAS.
Kyosaka, T.; Narita, A.; Kulski, J. K.; Minn, A. K. K.; Miyake, A.; Kotsar, Y.; Hiraide, K.; Ojima, T.; Nakatochi, M.; Namba, S.; Yamaji, T.; Sutoh, Y.; Sasaki, Y.; Broer, L.; Frost, F.; Koyanagi, Y. N.; Kasugai, Y.; Ito, H.; Sawada, N.; Nakano, S.; Suzuki, S.; Hishida, A.; Koyama, T.; Kubo, Y.; Funayama, T.; Makino, S.; Shirota, M.; Takayama, J.; Gocho, C.; Sugimoto, S.; Otsuka-Yamasaki, Y.; Tanno, K.; Abe, Y.; Nakajima, O.; Spaander, M. C. W.; Weiss, S.; Lerch, M. M.; Levy, D.; Hwang, S.-J.; Wood, A. C.; Rich, S. S.; Rotter, J. I.; Taylor, K. D.; Tracy, R. P.; Stocker, H.; Brenner, H.; Leja,
Show abstract
Helicobacter pylori (H. pylori) infects the gastric epithelium of approximately half of the global population, and is a well-known risk factor for developing gastric cancer. Despite the clinical significance of H. pylori infection, many genetic factors that contribute to susceptibility remain unidentified. While it is well-established that H. pylori infection can result in gastritis and peptic ulcers, which may progress to gastric cancer, its causal link to other diseases remains unclear. We performed the genome-wide association study (GWAS) for anti-H. pylori IgG antibody titers, which were validated as a surrogate marker for H. pylori infection by the correlation with clinical traits, followed by gene-based and pathway analyses, involving up to 140,863 individuals. This included 56,967 in the discovery phase, and 68,211 in the replication phase from Japanese cohorts, and an additional 15,685 from European populations in a cross-ancestry meta-analysis. We reveal significant associations between H. pylori infection and polymorphisms in Human Leukocyte Antigen (HLA) genes the Human Leukocyte Antigen (HLA) class II region within the Major Histocompatibility Complex (MHC), as well as genes related to innate immunity, including CCDC80, NFKBIZ, TIFA, PSCA, and TRAF3. Mendelian randomization (MR) analysis revealed that genetic liability to H. pylori infection has both positive and negative causal relationships with a variety of diseases, including autoimmune-related diseases such as Type 1 diabetes, Hashimoto's disease, atopic dermatitis, as well as traits like body height and weight. These genetic findings strongly support the notion that genetic liability to H. pylori infection influences not only gastrointestinal diseases, but also a broader spectrum of health issues, thereby providing valuable insights for public health strategies and personalized medicine approaches.
Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.
Show abstract
Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.
Kramer, B. K.; Kushner, S. A.; Rzhetsky, A.
Show abstract
Birth order has been implicated in the etiology of individual diseases, but has never been systematically assessed at phenome-wide scale with large administrative claims data and complementary epidemiological designs. Here we use two complementary approaches: a between-family matched cohort of 1.6 million pairs and a within-family sibling comparison which includes 5.1 million families and 10.3 million individuals. These were both applied to 569 diseases defined by the ICD9-CM/ICD10-CM codes in the commercial claim data of Merative MarketScan. Of 418 diseases with adequate case counts, 150 show Bonferroni-significant birth-order associations. All odds ratios compare second-borns with first-borns, so OR < 1 indicates first-born excess. First-borns are at an excessive risk for neurodevelopmental conditions (autism OR = 0.74, ADHD OR = 0.93) and immune-allergic diseases consistent with the hygiene hypothesis (food allergy OR = 0.80, allergic rhinitis OR = 0.91), while second-borns are at an excessive risk for substance abuse (OR = 1.19) and gastrointestinal conditions. Between-family and within-family estimates agree in direction for 84.7% of significant diseases (Pearson r = 0.65), and results are robust to state fixed effects (r = 0.997) and full-sibling restriction. Prespecified validation controls were broadly consistent with expectations. These findings provide a comprehensive map of birth-order effects across the human disease phenome.
Tueux, G.; Pouilly, N.; Bernigaud-Samatan, J.; Blanchet, N.; Boniface, M.-C.; Catrice, O.; CARRERE, S.; Gouzy, J.; Jacquemot, M.-P.; Lauber, E.; Legendre, A.; Moreau, S.; Moroldo, M.; Roldan, A.; Carlier, A.; Langlade, N.
Show abstract
Nectar is a hub for plant-pollinator interactions, yet gene-level causal links between plant genetic variation, pollinator foraging and nectar microbial assembly remain poorly resolved. Using near-isogenic lines, innovative field time-lapse monitoring of pollinator visits and long-read amplicon sequencing of nectar microbiota, we show that a natural single-nucleotide variant at a cell-wall invertase gene (HaCWINV2) controls sunflower nectar chemistry and influences both pollinators and microbes. Plants homozygous for a loss-of-function HaCWINV2 allele produce sucrose-rich nectar, resulting in fewer bee visits under field conditions, while bumblebee visitation remained unaffected. In pollinator-excluded flowers, invertase-deficient plants harboured greater fungal diversity and compositionally distinct communities, indicating that nectar sugar profiles act as ecological filters shaping the nectar microbiome. This loss-of-function allele was found rarely and only at the heterozygote state in wild sunflowers and was fixed in 35% of cultivated lines indicating a positive selection during domestication. Our findings establish a causal link between a single gene and nectar chemistry with cascading ecological effects in a plant-pollinator system, illustrating how subtle genetic changes scale up to alter nectar traits, microbial assembly and pollinator foraging behaviour.
MacGregor, H. A. J.; Blundell, J. R.; Easton, D. F.
Show abstract
Pathogenic variants in TP53, the key tumour-suppressor gene underlying Li-Fraumeni syndrome (LFS), are among the best-established causes of inherited cancer predisposition. However, large-scale sequencing has revealed that many apparently pathogenic TP53 variants detected in blood are the result of somatic clonal expansions, complicating risk interpretation. Using blood-derived whole-exome data from 469,391 UK Biobank participants, we combined variant allele fraction (VAF) with haplotype-sharing analysis to distinguish germline and somatic TP53 variants. Germline variants were concentrated at sites linked to partial loss of p53 function and lower disease penetrance, whereas classic LFS alleles appeared almost entirely somatic. High-VAF carriers of classic LFS alleles conferred markedly increased risk of haematological malignancy but not solid tumours, consistent with large TP53-mutant clonal expansions. The prevalence of somatic clonal expansion also correlated with missense variant pathogenicity, suggesting that somatic activity provides an informative in vivo proxy for functional impact. These results provide new insights into TP53-associated cancer risk at the population level, demonstrate that somatic rather than germline risk predominates in middle-aged healthy adults and provide a scalable framework for variant classification in large-scale population genomics.
Fuhrer, J.; Shadrin, A. A.; Hughes, T.; Parker, N.; Hindley, G.; Frei, E.; Nguyen, D.; Smeland, O. B.; Djurovic, S.; Andreassen, O.; Dale, A.; Frei, O.
Show abstract
The genetic architecture of complex traits spans a continuum of polygenicity, yet it remains unclear how differences in polygenicity relate to the functional localization of SNP heritability across the genome. We use a MiXeR-based framework to partition heritability across exonic, intronic, and intergenic regions for 34 traits and introduce a likelihood-based annotation contribution score that quantifies annotation-specific impact on heritability. Exons explain a minority of heritability, and their contribution decreases with increasing polygenicity, from an average of 22% in less polygenic somatic diseases and biomarkers to 13% in highly polygenic psychiatric and cognitive phenotypes. Intergenic fractions show the opposite trend, whereas intronic fractions remain relatively stable. Analysis of a broader set of functional annotations reveals systematic differences along the polygenicity axis: highly polygenic traits show stronger contributions from comparative genomics and variant-effect scores, whereas less polygenic traits show stronger contributions in promoter, transcription, and chromatin annotations. Together, these results indicate that the functional partitioning of heritability systematically varies with polygenicity, pointing to a shift from gene-proximal regulatory architectures to architectures shaped by numerous dispersed regulatory effects as a key determinant of differences in polygenicity across traits.
Wang, Y.; Truong, B.; Lu, W.; Fadil, C.; He, Y.; Luo, W.; Koyama, S.; Tsuo, K.; Paruchuri, K.; Yu, Z.; Hull, L. E.; Zheng, Z.; Carey, C. E.; Walters, R. K.; Neale, B. M.; Robinson, E. B.; Kraft, P.; Natarajan, P.; Martin, A. R.
Show abstract
Polygenic scores (PGS) are typically derived from single-trait genome-wide association studies (GWAS), yet many complex diseases arise from shared genetic liability distributed across correlated clinical dimensions. Accordingly, disease risk depends not only on how genetic liability is represented but also on the social context in which that liability is expressed. Whether phenome-derived latent factors improve prediction, and how social determinants of health (SDoH) modify the realized utility of PGS, remains unclear. Here we constructed PGS for 35 orthogonal latent phenomic factors derived from 2,772 phenotypes in 361,114 UK Biobank (UKB) participants and evaluated their phenomic specificity, cross-dataset portability and predictive performance relative to conventional disease-specific PGS across the UKB holdout, Mass General Brigham Biobank and the All of Us (AoU) Research Program. Factor-based PGS showed widespread, biologically coherent phenome-wide associations that were reproducible across biobanks and ancestries. Their predictive utility, however, was strongly disease dependent. For asthma, a respiratory factor PGS outperformed an internally derived disease-specific PGS and showed superior cross-ancestry portability, retaining 41.5% of European-ancestry predictive accuracy in African-ancestry individuals, compared with 22.9% for an asthma PGS derived from the largest available multi-ancestry GWAS. By contrast, disease-specific PGS remained superior for coronary artery disease (CAD) and type 2 diabetes (T2D). These findings suggest that phenome-derived aggregation is most beneficial when disease-specific GWAS incompletely capture underlying liability, including settings of biological heterogeneity or imprecise phenotyping. We then evaluated SDoH in AoU as a complementary axis shaping prevalent disease prediction beyond genetic susceptibility. Across all three diseases, SDoH contributed substantial and largely independent predictive information beyond the disease-optimal genetic model. SDoH also modified how genetic liability translated into observed disease prevalence: for asthma and CAD, genetic stratification attenuated with increasing social burden, whereas this attenuation was substantially weaker for T2D. As a result, the same genetic percentile corresponded to different standardized predicted prevalences across social strata, reflecting disease-specific shifts in baseline prevalence, genetic gradients and calibration. Together, these findings indicate that disease risk is shaped by both genetic liability and the social context in which that liability is realized. Phenome-derived PGS improve prediction under specific architectural conditions, whereas social context independently modifies the performance, calibration and interpretation of genetic risk across populations.
Shi, Z.; Zhang, Z.; Mandla, R.; Hou, K.; Pasaniuc, B.
Show abstract
Polygenic scores (PGS) have emerged as a useful biomarker for stratification of high-risk individuals in genomic medicine, with prediction intervals arising as a principled approach to incorporate statistical uncertainty in their individual-level predictions. In contrast to recent reports by Xu et al7, we show that CalPred6 provides well-calibrated prediction intervals that contain the trait phenotypes at targeted confidence levels. CalPred maintains calibration when PGS performance varies across contextual factors (e.g., ancestry, age, sex, or socio-economic factors) whereas PredInterval7 - a recently introduced method that focuses on marginal calibration across all individuals - exhibits miscalibration.
Yang, D.; Yang, Y.; Ray, N. R.; Li, M.; Benchek, P.; Crawford, D. C.; O'Toole, J. F.; Sedor, J. R.; Reitz, C.; Lynn, A.; Zhu, X.; Haines, J. L.; Alzheimer's Disease Genetics Consortium (ADGC), ; Bush, W. S.
Show abstract
Epidemiological studies have consistently shown that chronic kidney disease is associated with increased Alzheimer disease risk. However, the underlying genetic architecture connecting these two conditions remains largely unexplored beyond genome-wide correlation analyses. Here, we conducted the first comprehensive, multi-ancestry, large-scale genetic investigation to identify shared genetic components between kidney function and Alzheimer disease. We leveraged large-scale genome-wide association study summary statistics for estimated glomerular filtration rate (N {approx} 1.5 million European, N {approx} 145,000 African ancestry) and late-onset Alzheimer disease (N = 63,926 and N = 398,058 in two European cohorts; N = 9,168 in African ancestry) corrected for competing risk bias. We deployed a novel analytical framework integrating linkage disequilibrium score regression and polygenic risk score analysis, local analysis of [co]variant association, conjunctional false discovery rate analysis with Bayesian colocalization and fine-mapping, and bidirectional cis-Mendelian randomization to identify vertical pleiotropy. Despite the absence of genome-wide genetic correlation (rg {approx} 0, p > 0.1), local genetic analysis uncovered striking regional heterogeneity. Sixteen pleiotropic loci were identified in individuals of European ancestry (conjunctional false discovery rate < 0.05), including APOE, PICALM, SPI1, and EFTUD1, alongside 15 loci with significant local genetic correlations. Fine-mapping revealed that most pleiotropic loci harbored distinct causal variants for kidney function and Alzheimer disease, indicating horizontal pleiotropy. An APOE {epsilon}4-defining allele (rs429358) was the sole variant with shared causality across both traits. We identified vertical pleiotropy using cis-Mendelian randomization at the PICALM and EFTUD1 loci, providing evidence that kidney function-related genetic variants can causally affect Alzheimer disease risk at specific genomic loci. In contrast, loci such as CD2AP, MAT1A, and SYMPK demonstrated horizontal pleiotropy, reflecting shared upstream biological pathways rather than direct causal mediation. Notably, APOE was the only pleiotropic locus shared between European and African ancestry groups, underscoring marked ancestry-specific genetic architectures with critical implications for risk prediction and therapeutic translation. Alzheimer disease and kidney function share genetic components at specific loci rather than genome-wide, with mixed directional effects and horizontal pleiotropy explaining the absent global correlation despite strong local signals. At a subset of loci, we identified directional effects linking kidney genetic determinants to Alzheimer disease risk using cis-Mendelian randomization, supporting a complex kidney-brain genetic axis. Most overlap reflects horizontal pleiotropy, with limited loci showing vertical pleiotropy. APOE was the only shared locus across ancestries, underscoring ancestry-specific architectures with implications for risk prediction. The multi-scale approach used here also provides a methodological framework for dissecting complex disease relationships missed by traditional genome-wide analyses.
Brundage, D.
Show abstract
The domestic dog (Canis lupus familiaris) is a powerful model for genetic studies of complex disease, but canine genotype data are distributed across independent studies using incompatible genotyping platforms, genome builds, strand conventions, and allele coding schemes. Here we present CanVAS, a quality-controlled, harmonized, and imputed canine genotype resource integrating 15 publicly available datasets into a single analysis-ready PLINK file set on the CanFam4 (UU_Cfam_GSD_1.0) reference assembly. The typed backbone contains 15,451 dogs from over 375 breeds, village dog populations, dingoes, wolves, and coyotes, genotyped across 77,215 shared SNPs. Imputation against the Dog10K whole-genome sequencing reference panel (1,929 dogs) using Beagle 5.4 expanded the resource to 9.7 million variants (DR2 [≥] 0.3, MAF [≥] 0.01), including approximately 3 million rare variants (MAF < 0.05). We describe the complete harmonization pipeline and validate the resource through population structure analysis and genome-wide runs-of-homozygosity analysis, recovering known breed-level differences in genomic inbreeding.
Gragert, L.; Madbouly, A.; Bashyal, P.; Wadsworth, K.; Kempenich, J.; Bolon, Y.-T.; Maiers, M.
Show abstract
The human leukocyte antigen (HLA) system is the primary determinant of donor selection in allogeneic hematopoietic cell transplantation (HCT) and plays a central role in solid organ transplantation, immune-mediated disease studies, evolutionary population genetics, and immunotherapy. Large-scale sampling of registry participants reflecting major US ancestry groups allows for characterization of the complex landscape of HLA haplotype diversity for the classical HLA class I (HLA-A, HLA-B, HLA-C) and HLA class II (HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5, HLA-DQA1, HLA-DQB1, HLA-DPA1, and HLA-DPB1) genes. Here we present nine-locus classical HLA allele and haplotype frequency estimates for five broad (Black, White, Asian or Pacific Islander, Hispanic and Native American) and 21 detailed US populations based on 9,671,082 donors with targeted genotyping by DNA-based methods. Frequency estimation used an expectation-maximization (EM) framework specifically adapted to handle mixed-resolution and ambiguous HLA genotyping data. Advancements in next-generation sequencing provide extensive HLA genotyping, offering new insights into the haplotype structure and diversity of the human MHC complex, expanding knowledge especially for HLA class II haplotypes. Population analyses reveal that the most common high-resolution haplotypes are predominantly population-specific, with only three haplotypes shared across the top-100 lists of all five broad population groups, and that Black populations exhibit the greatest nine-locus haplotypic diversity, a pattern that persists after controlling for differences in registry sample size. These frequencies, derived from the largest US cohort to date, support clinical decision-making and research in histocompatibility, immunogenetics, and transplantation and are publicly available at https://zenodo.org/records/17966993.
Yuan, H.; Mandava, A.; Sarmart, K.; Ganz, J.; Krishnan, A.
Show abstract
Genome-wide association studies (GWAS) have implicated thousands of loci in complex diseases, but translating these population-level signals into specific cellular contexts remains a central challenge. Integrating GWAS with single-cell transcriptomics data has enabled systematic identification of disease-relevant cell types, yet existing methods face a fundamental tradeoff: approaches like seismic that optimized for statistical power operate at the annotated cell-type level and miss heterogeneous disease signals concentrated in specific cellular states, while single-cell-resolution approaches like scDRS that capture such heterogeneity often lack sufficient power to detect subtle associations. Here we present ICePop (Informative Cell Populations), a framework that resolves this tradeoff by performing disease-cell type association at metacell resolution, thus achieving statistical power comparable to cell-type-level methods while detecting heterogeneous disease signals within cell types. In simulations against seismic and scDRS, ICePop maintains appropriate false positive rates and demonstrates superior power when disease effects are concentrated in cellular subpopulations. Applied to Tabula Muris across 81 traits and 120 cell types, ICePop identifies 2,178 disease-cell type associations, including the preferential vulnerability of differentiated gut epithelial cells in ulcerative colitis and loss of cell identity in immune-stressed lung capillary endothelial cells underlying their association with lung function. Clustering diseases by metacell association profiles reveals groupings that diverge from genetic risk-based clustering, including separation of blood cell count traits from immune diseases despite shared genetic architecture, reflecting differences in cellular rather than genetic etiology. In autism spectrum disorder, ICePop identifies preferential enrichment of genetic risk in specific enteric neuron subtypes, implicating dysfunction of the enteric nervous system in gastrointestinal comorbidities. ICePops resolution of disease-relevant cell states within annotated cell types enables generation of testable, cell-state-specific hypotheses about disease mechanisms and therapeutic targets.
Yang, C.; Zhang, X.; Chen, J.
Show abstract
Methods that map genetic risk to cells identify disease-relevant tissues and cell types but cannot test whether genetic effects concentrate at molecular interfaces between cells. Here we introduce EdgeMap, which integrates spatial transcriptomics with GWAS summary statistics to partition trait heritability into cell-intrinsic and intercellular components and to resolve the intercellular signal into specific ligand-receptor channels. Across 17 traits and five human tissues, edge heritability is enriched in biologically coherent trait-tissue pairings (3.8-fold; P = 4.4 x 10-6) and replicates across independent tissue sections, GWAS cohorts, and cell-segmented Visium HD. Per-pair decomposition identifies 67 trait-specific channels (FDR < 0.10) organized into convergent pathway families--neurexin-neuroligin synaptic signaling in bipolar disorder, vascular adhesion in cardiovascular traits, and lipoprotein-clearance pathways in liver. Most edge genes are absent from standard gene-level prioritization, supporting intercellular communication as a complementary dimension of genetic architecture.
Small, A. M.; Yu, M.; Berrandou, T. E.; Georges, A.; Huff, M.; Morningstar, J. E.; Rand, S. A.; Koyama, S.; Lee, J.; Vy, H. M.; Farber-Eger, E.; Jin, S.; Dieterlen, M.-T.; Kontorovich, A. R.; Yang, T.-Y.; Do, R.; Dressen, M.; Krane, M.; Feirer, N.; Doppler, S. A.; Schunkert, H.; Trenkwalder, T.; Wells, Q. S.; Berger, K.; Ostrowski, S. R.; Sorensen, E.; Pedersen, O. B.; Bundgaard, J. S.; Ghouse, J.; Bundgaard, H.; Ganna, A.; Erikstrup, C.; Mikkelsen, C.; Bruun, M. T.; Aagaard, B.; Ullum, H.; Abner, E.; Slaugenhaupt, S. A.; Nadauld, L.; Knowlton, K.; Helgadottir, A.; Sveinbjornsson, G.; Gudbjart
Show abstract
Mitral valve prolapse (MVP) is the most common cause of primary mitral regurgitation and is associated with the development of malignant arrhythmias, often in the context of myocardial fibrosis. The genetic architecture of MVP, and whether there are genetic factors explaining why only some individuals with MVP have adverse outcomes, remains poorly understood. We performed a meta-analysis of genome-wide association studies (GWAS) for MVP encompassing 21,517 cases among a total sample size of over 2.2 million individuals. We discovered 89 genomic risk loci for MVP, of which 72 were novel findings. Prioritization of causal genes and pathways using epigenetic and transcriptomic data from mitral valve and extra-valvular tissues replicated known gene associations to MVP including those involved in TGF-{beta} signaling and extracellular matrix biology, but additionally emphasized a role in MVP for biological pathways relevant to cardiomyocyte biology. Accordingly, we identified several MVP risk loci with pleiotropy to cardiomyopathies, especially hypertrophic cardiomyopathy, and demonstrated a significant genetic correlation between MVP and hypertrophic cardiomyopathy. Finally, we interrogated snRNA-seq data in human papillary muscle tissue from two individuals with severe MVP, characterizing genes associated with both risk of papillary muscle fibrosis and MVP.
Carver, S.; Perea-Chamblee, T.; Taraszka, K.; Moon, I.; Yu, X.; Ding, Y.; Carrot-Zhang, J.; Gusev, A.
Show abstract
Genome-wide association studies (GWAS) have advanced the understanding of germline susceptibility in common cancers, yet rare malignancies remain underexplored due to limited sample sizes. To address this gap, we conducted large-scale GWAS across 20 rare cancer types and meta-analyzed results from three cohorts: two clinically sequenced cancer center cohorts and an independent population biobank, comprising over 480,000 individuals. We identified nine novel genome-wide significant susceptibility loci with moderate to large effect sizes that replicated across cohorts in eight rare malignancies, including myelodysplastic syndromes (MDS), germ cell tumors, gastrointestinal stromal tumor (GIST), gastrointestinal neuroendocrine tumors, anal cancer (ANSC), non-melanoma skin cancer, mesothelioma, and hepatobiliary cancer. Among the strongest associations were loci in MDS near API5 (OR = 2.21, p = 1.06x10-8), in GIST near SLC6A18 and TERT (OR = 1.91, p = 8.20x10-50), and in ANSC near HLA-DQA2 (OR = 1.58, p = 5.50x10-18). The GIST risk variant was enriched in tumors harboring somatic KIT mutations (OR = 2.21, p = 6.5x10-4) and was associated with worse survival among carriers with KIT-mutant tumors (hazard ratio = 4.06, p = 0.015), implicating germline-somatic interplay in tumor initiation and progression. The ANSC risk variant was associated with HPV infection (OR = 1.44, p = 3.19x10-5), supporting a host-viral interaction in HPV-driven tumorigenesis. The MDS risk variant at the API5 locus was associated with altered neutrophil counts, suggesting a role in hematopoietic dysregulation in disease pathogenesis. We further identified novel, independent associations with mesothelioma, GIST, and hepatobiliary cancer at the 5p15.33 locus encompassing TERT, consistent with pleiotropic genetic effects at a core telomere-maintenance gene. Collectively, these findings demonstrate that integrating clinically ascertained sequencing cohorts with population biobanks substantially enhances germline discovery in rare cancers, enabling identification of high-confidence susceptibility loci and facilitating downstream biological interpretation through linked somatic, viral, and clinical data. This framework provides a scalable approach for characterizing inherited susceptibility across diverse rare malignancies.
Qiu, D.; Mao, Z.; He, J.; Xu, Z.; Liu, C.; Davtian, D.; Chen, Q.; Karaca, S.; 23andMe Research Team, ; Cabrera Mendoza, B.; Polimanti, R.
Show abstract
The extent of shared and disorder-specific etiology among generalized anxiety disorder (GAD), major depressive disorder (MDD), and posttraumatic stress disorder (PTSD) remains unclear. Leveraging multiple cohorts, we conducted a multivariate and multi-ancestry genome-wide association study of GAD (N=1,358,762), MDD (N=3,601,629), and PTSD (N=1,617,876). We identified 248 loci associated with the latent internalizing disorder factor (INT), 591 with MDD, 237 with PTSD, and 109 with GAD. While GAD and PTSD genetic risk demonstrated strong overlap with the INT factor, 38% of MDD genetic signals were disorder-specific. Cross-population fine-mapping uncovered >450 causal variants, and the subsequent multi-omics characterization linked them to >1,250 genes, including both novel shared and disorder-specific loci. Considering the high-confidence findings converging across analytic approaches, we observed that the genetic liability shared across internalizing spectrum is driven by broadly acting cellular and regulatory mechanisms, whereas disorder-specific genetic risk reflects more specialized perturbations in neurodevelopmental, synaptic, and stress-responsive pathways.
Jacobsen, J. T.; Moller, P. L.; Rohde, P. D.
Show abstract
Genomics offer a powerful approach to identify causal mechanisms underlying coronary artery disease (CAD) risk, with implications for pathogenesis, personalized prevention strategies, and therapeutic target discovery. Functionality-informed genome-wide fine mapping was performed using the Bayesian framework SBayesRC to estimate genetic contributions of 6.9 million common variants, based on GWAS summary statistics from over one million individuals of European ancestry. Causal candidate genes were prioritized in a 5kB flanking window within high-confidence local credible sets (LCSs). Their downstream biological influence was analyzed using protein-protein interaction networks and pathway enrichment analyses across three complimentary dimensions: molecular, cellular, and disease level. Genetic modeling captured the highly polygenic architecture of CAD, estimating on average 34,000 variants to contribute to CAD risk, explaining 3.8% of total phenotypic variance. 36 high-confidence variants (PIP > 0.9) collectively explained 13.6% of genetic variance, while most variants demonstrated small individual effects but with substantial collective contributions. 17,150 variants were prioritized within 581 high-confidence LCSs, of which 195 were annotated to genes and 170 were implicated in downstream pathway analyses. The three most influential variants were mapped to PHACTR1, APOE, and LPL, explaining 2.49%, 1.59%, and 1.46% of genetic variance respectively. Pathway analyses revealed that genetic risk in CAD is driven by dysregulation of three interlinked biological processes: 1) lipoprotein function and cholesterol metabolism, 2) vascular homeostasis, and 3) cellular stress responses and inflammation. These findings advance the causal understanding of CAD pathogenesis, supporting the transition from association-based to functionality-informed genomic approaches in cardiovascular genetics.
Pacar, I.; Ungaro, M. T.; Chen, Y.; Dallali, H.; Medico, J. A.; Hebbar, P.; Diekhaus, M.; Di Tommaso, E.; Geleta, M.; Chan, P. P.; Lowe, T. M.; Balacco, J.; Jain, N.; Ackerman, F.; Mochi, M.; Ioannidis, A. G.; Sawarkar, N.; Diaz, K.; Krishna Sudhakar, K.; Powell, J. E.; Jain, M.; Rosa, A.; Croft, G. F.; Tanzer, A.; Jarvis, E. D.; Formenti, G.; Salama, S. R.; Giunta, S.
Show abstract
Advances in DNA sequencing and assembly technologies are spurring a shift from haploid reference genomes to sample-specific diploid assemblies. Here, we generated the first telomere-to-telomere (T2T) diploid reference for the widely used human embryonic stem cell (hESC) line, H9 (WAe009-A). This haplotype-resolved assembly is highly accurate with comprehensive annotation of genes, segmental duplications, methylation, and chromatin conformation. Pangenomic and phased-locus inference point to H9s mixed ancestry with a predominant European component. H9-specific genomic features include near-perfect telomeres [~]1.65-fold longer than other T2T assemblies, consistent with telomerase activity during pluripotency; chromosome 17 inversions that can predispose offspring to neurological syndromes; and expansions of ncRNA clusters, with overall genomic stability maintained despite extensive culturing. Mapping multi-omic datasets to the genome, we demonstrate the power of this resource for allele-specific, high-precision transcriptomic, genetic, and epigenetic analyses, with far-reaching implications for human development and disease.
Ravid, A.; Ladany, H.; Gusev, A.; Maruvka, Y. E.
Show abstract
Cancer development is shaped by somatic mutational processes that leave characteristic patterns known as mutational signatures. The inherited determinants of variability in signature activity remain largely unknown. Common germline variants that regulate this activity, which we term Signature Quantitative Trait Loci (SigQTLs), are expected to have modest individual effects, requiring cohorts of tens of thousands of samples for reliable detection. Clinical targeted-panel sequencing datasets achieve this scale, but present a fundamental challenge: individual tumors typically harbor too few mutations for stable signature inference. To overcome this sparsity barrier, we introduce GroupSig, a framework that aggregates sparse mutational patterns across samples sharing a germline genotype into information-rich meta-samples, enabling robust signature inference at the population level. We validated GroupSig by recovering the well-established correlations between age and clock-like signatures SBS1 and SBS5 using emulated panel data from The-Cancer-Genome-Atlas. We then applied GroupSig to approximately 32,000 tumor samples from the Dana-Farber Cancer Institute PROFILE cohort in a genome-wide SigQTL scan. We identified 9 genome-wide significant SigQTLs, with the strongest signal at locus 16q24.3, where six variants were associated with increased SBS7 (UV exposure) activity. This association persisted after excluding melanoma samples, arguing against a tumor-type enrichment artifact. Validation in TCGA confirmed 6 SigQTLs, all at 16q24.3, where implicated variants are eQTLs for CDK10 and SPG7 in skin tissue. Beyond genome-wide hits, DNA repair genes were 12.6-fold enriched among sub-threshold signals, supporting a polygenic architecture for mutational process regulation. GroupSig provides a scalable framework for germline-somatic association studies using panel sequencing data.